Introduction

ISIS’ ability to build and maintain a large online community that disseminates propaganda and garners support continues to give their message global reach. Although these communities contain trained media cadre, recent literature suggests that large numbers of “unaffiliated sympathizers” who simply re-tweet or re-post propaganda explain ISIS’ unprecedented online success. Tailored methodologies to detect and study these online extremist communities (OEC) could help provide the understanding needed to craft effective counter-narratives, however continued development of these methods will require collaboration between methodologists and regional experts. This tutorial will enable a researcher to detect an OEC embedded in a large online social network through supervised machine learning methods. A more detailed explanation of the methods presented here as well as illustrative intelligence extractions are provided in the following works:

Problem Statement

In general terms, this methodology is designed to detect large online extremist communities given a handful of examples. In previous work we have used many different sampling strategies, but find that snowball sampling known members’ following ties typically returns usefull results. Snowball sampling is a non-random sampling technique where a set of individuals is chosen as “seed agents.” The k most frequent accounts followed by each seed agent are taken as members of the sample. This technique can be iterated in steps. For example, in a two-hop snowball sample of users’ following ties we would take the union of our seed agents’ following ties to define the seed agents for hop 2. Although this technique is not random and prone to bias, it is often used when trying to sample hidden populations. It is also worth noting that it tends to return very large, noisy data sets due to users simultaneous membership in many online communities. For example in Benigni et. al (2016), a two-hop snowball sample of 5 known ISIS propagandist’s following ties yielded over 120,00 accounts. The majority of those accounts are not of interest, and standard network clustering techniques often to not identify extremist communities with adequate precision. Therefore we need to partition this set of users in two groups: OTGSC members, and non-members.

Overview

To accomplish this we will conduct our analysis in 4 phases. Due to time and computational constraints, we will assume both collection and initial feature space development to have taken place. I will explain our data sources and search techniques in the Data section. We find that discriminating between accounts that are highly central within our extremist community of interest and accounts that are globally central requires a classifier. Therefore, in Phase 1 we will remove accounts with anomalously high following and mention counts like celebrities, politicians, etc. In Phase II I will provide output from unsupervised methods yet to be published that can be used to rapidly develop a training set. More detail on these methods will be forthcomong in the near future. In Phase III we will train a classifier to detect our extremist community of interest, an in Phase IV we will explore this community using ORA.

We start with a search set of just over 91,000 accounts initiated based on 4 highly central figures within the Euromaidan movement. In no way do I imply that this group meets the criteria to be labelled as an ‘extremist community’, however their use of mentions, hashtags, and sharing of multimedia closely resembles the social media activity of an OEC. We then remove roughly 8,000 official accounts, and quickly find 500 positive case instances via unsupervised dense community detection. We then train a classifier to detect over 4000 members of the Twitter community that actively posts about the Euromaidan movement.[INSERT LINK]

This tutorial is designed to be executed with its accompanied R script “tutorial_execution_script.R”. Please download the source code and data provided INSER FIGSHARE LINK. The script will automatically develop the file structure you need to execute further analysis when you source the functions. It will also auto-install all package dependencies required. The code and functions here have been developed in OS-X and I have conduced a bit of testing on Windows. The WIN parameter in the denseFeatures() and reportFeatures() functions is needed if implemented in a Windows environment. This tutorial was developed on a Macbook Pro with 8G of RAM and required nearly all of it when processing the accompanying data set. Additionally, execution on Windows platforms is still a bit unstable. Feedback on issues is welcomed.

Data

Within this tutorial, we will assume the user has already collected their search set from the Twitter API and constructed the following files:

These files can be generated with the Python library twitter_dm. For the purpose of this tutorial, these files have already been uploaded, cleaned, and transformed into a feature space for classification. Source code for these steps will be available upon publication of the cited works above.

Data Representation and Feature Space Development

Our initial task is to develop a feature space that leverages each of the user behaviors captured in the attached files. Ultimately we wish to represent each user as a row with numerical features as columns. The challenge lies in including both node level features and network features ( i.e. user interaction). We do so by extracting lead Eigenvectors from undirected representations of the following and mention networks. Additionally, we extract eigenvectors from the left matrix of the SVD decomposition of the user, hash tag network. Due to time and computational constraints this step will not be covered in this tutorial. However, we provide an RData file with the initial feature space. Before starting, we need to souce the functions provided for the tutorial and install all associated packages. Chunk 1 should perform these tasks for you.
Chunk 1: install packages and load functions

list.of.packages <- c("data.table","bit64","R2HTML","caret","randomForest","shiny")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

library(shiny)
library(data.table)
library(bit64)
library(caret)
library(randomForest)
library(R2HTML)


Phase 1: Remove Celebrities, Politicians, and other Highly Followed Accounts


The concept of a community in social media is merely a set of users where connections within the group are more dense than connections outside the group. One problem however is that some users have in inordinate number of mentions and followers. Celebrities, politicians, and news media tend to be falsely detected as OEC members because of these connections, we must first train a classifier to remove them. I refer to these as “official accounts”, because they contain many different types. We use a list of such accounts to develop train and test sets and subsequently train a classifier.

Function: TrainTest_SetBuilder()

Description

Divides labeled data into train and test sets via random sampling, and provides a third set of the remaining unlabeled data for subsequent classification.

Input

  • data - a data frame of features, usually output from classifierFeatures()
  • PositiveLabels - a vector of userIDs to be used as positive case labels
  • NegativeLabels - a vector of userIDs to be used as negative case labels
  • p.test.split - numerical value between 0 and 1 that defines the amount of data to be held our for evaluation (i.e. the train/test split).

Value

Returns a list of 3 data frames (training users, test users, and unlabeled users). This output is designed to be subsequently used with the classifier() function.

Function: classifier()

Description

Trains and applies various machine learning classifiers on output from TrainTest_SetBuilder() . The function is designed to easily be used for development or application of your classifier.

Input

  • data_list - output from TrainTest_SetBuilder()
  • algorithm - a character string depicting which classifier algorithm to be applied choices are:‘randomForest’,‘svmRadial’,‘JRip’,and ‘J48’. See the packages ‘carat’ and ‘randomForest’ for more detail. We find ‘randomForest’ to be the best option in terms of performance and speed.
  • evaluation - TRUE or FALSE. If TRUE the classifier is trained on the training set and applied and evaluated on the test set. If FALSE, a final model is trained using all labeled data and applied to your unlabeled data.
  • ratio - the percent winning class for a randomForest classifier. Values close to 0 stress precision while values close to 1 stress recall.
  • t - a character tag that is added to file output for naming convention.

Value

When evaluation is set to TRUE, the function returns a model object and performance estimates. When evaluation is set to FALSE, the function returns a list of userIDs predicted as positive case.



Chunk 2: load data and functions and develop official account classifier

load(file.path('RData','Phase_1_Features.RData'))
set.seed(25)
officialIDs=fread(file.path('input','officialIDs.csv'),integer64='numeric',data.table=FALSE)$V1 
posLabels=officialIDs[officialIDs %in% features$userID] 
negLabels=sample(features$userID[features$followerCount<100000],length(posLabels))
officialTrainTest=TrainTest_SetBuilder(data=features,PositiveLabels=posLabels,NegativeLabels=negLabels,p.test.split=.4)  

r=.5 # lower r emphasized precision over recall
classifier(data_list=officialTrainTest,
           algorithm='randomForest',
           evaluation=TRUE,
           metric='Kappa',
           ratio=r, 
           t='official')



Evaluation metrics are returned to the user, and feature selection can be used to develop the classifier. For the exercise we will accept the baseline results provided by chunk [INSERT]. If you wish to adjust the features used in this classifier open the file “featureSetNames.csv” in the “input” folder and assign 1 next to features you wish to use and 0 next to features you wish to ignore. This file is typically generated while the feature space is developed and will source code will be available upon publication.
Now to train and apply your classifier in chunk INSERT, we change the evaluation parameter to FALSE. This now trains the classifier on all labeled data and applies it to our unlabeled data. Random samples of cases labeled “official accounts” and cases not labeled “official” are provided in the file “official_Model_Validation.html” in the “output” folder and can be quickly inspected in your web browser.

Chunk 3: train the official classifier and remove official accounts

set.seed(25)
official.predicted =classifier(data_list=officialTrainTest,
                               algorithm='randomForest',
                               evaluation=FALSE,
                               metric='Kappa',
                               ratio=r, # *** important parameter
                               t='official')

officialIDs=c(official.predicted,posLabels)
features=features[!(features$userID %in% officialIDs),]
save(features,file=file.path('RData','Phase_2_Features.RData'))
rm('negLabels','official.predicted','officialIDs','officialTrainTest','posLabels','r','list.of.packages','new.packages')



Official accounts have now been identified for removal from our search set and we can train our classifier to detect the OEC, but it is also useful to check the classifier output to verify the types of accounts being removed Verify Classifier Output.





Phase 2: Growing Your Training Set

Ultimately we need to be able to provide both positive and negative case labels to users within our search set. To do so, we extract dense communities where users mention, following, and shared hash tag networks all indicate homophily within the group. These dense communities often appear to be ideologically grouped and to some extent ‘extreme’ in their positions. This allows us to confidently use them as labelled instances if random sampling indicates consistently positive or negative case population. For this tutorial we will assume these algorithms to have run and explore the output in the accompanying Shiny application.

Chunk 4: Update positive case labels and develop Euromaidan classifier

load(file.path('RData','dense_communities.RData'))
posLabels=denseOutput$userID[denseOutput$denseCluster=='d-14']
set.seed(25)
negLabels=sample(features$userID,length(posLabels))
negLabels=negLabels[!(negLabels %in% posLabels)]

euromaidanTrainTest=TrainTest_SetBuilder(data=features,PositiveLabels=posLabels,NegativeLabels=negLabels,p.test.split=.4)  
r=.5 # lower r emphasized precision over recall
classifier(data_list=euromaidanTrainTest,
           algorithm='randomForest',
           evaluation=TRUE,
           metric='Kappa',
           ratio=r, # *** important parameter 
           t='euromaidan')


Upon algorithm, parameter, and feature space selection we then detect the greater community by training and applying the classifier.

Phase 3: Detect the Online Community of Interest

Now we build our feature space with official accounts removed and develop our positive and negative examples with which to train the classifier that detects extremists. The process is identical to the one used in Phase I, we simply are using different labels and a subset of our original search results.



Chunk 6: Now we update the feature set before training the jihadist classifier and update our labels

set.seed(25)
euromaidan.predicted =classifier(data_list=euromaidanTrainTest,
                               algorithm='randomForest',
                               evaluation=FALSE,
                               metric='Kappa',
                               ratio=r, # *** important parameter
                               t='euromaidan')

euromaidanIDs=c(euromaidan.predicted,posLabels)
features=features[features$userID %in% euromaidanIDs,]
df=denseOutput[,c('userID','denseCluster')]
finalOutput=merge(df,features,by='userID',all.y=TRUE,sort=FALSE)
finalOutput$ScreenName=hyperlink(finalOutput$ScreenName)
save(finalOutput,file=file.path('RData','euromaidan_community.RData'))



Phase 4: Analysis

Detecting these communities at scale enable us to apply social network analysis the the community and identify things like key users and group sub-structure. For this portion of the tutorial we will use ORA. A student version can be downloaded at the following link.